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A Regularization Approach for Prediction 
of Edges and Node Features in Dynamic 

Graphs l^ 



Abstract 



^^ I We consider the two problems of predicting links in a dynamic 

graph sequence and predicting functions defined at each node of the 
graph. In many applications, the solution of one problem is useful for 
solving the other. Indeed, if these functions reflect node features, then 
tH- i they are related through the graph structure. In this paper, we for- 

P\| ' mulate a hybrid approach that simultaneously learns the structure of 

the graph and predicts the values of the node-related functions. Our 
approach is based on the optimization of a joint regularization objec- 
tive. We empirically test the benefits of the proposed method with 
both synthetic and real data. The results indicate that joint regular- 



Y^ [ ization improves prediction performance over the graph evolution and 

the node features. 

^ • 1 Introduction 

m 

T^lj- I Forecasting the behavior of systems with multiple responses has been a chal- 

•^ ■ lenging problem in the context of many applications such as collaborative 

cn . filtering, financial markets, or bioinformatics, where responses may be, re- 

^ I spectively, movie ratings, stock prices, or activity of genes within a cell. 

Statistical modeling techniques have been widely applied for learning multi- 
variate time series either in the multiple linear regression setting [3] or with 
, . , autoregressive models [H]. More recently, kernel-based regularized methods 

r^ • have been developed for multitask learning [TIE]. These approaches share 

C^ . in common the use of the correlation structure between input variables to 

enhance prediction of every single output. Frequently, the correlation struc- 
ture is assumed to be given or is estimated separately. A discrete encoding 
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of correlations between variables can be modeled as a graph so that learning 
the dependence structure amounts to performing graph inference through 
the discovery of unobserved edges on the graph. The latter problem is in- 
teresting per se and is known as link prediction, where it is assumed that 
only part of the graph is actually observed [TUl E] . This situation occurs in 
various applications such as recommender systems, social networks, or pro- 
teomics, and the appropriate tools can be found among matrix completion 
techniques [18l [5l [1] . In the realistic setup of a time-evolving graph, matrix 
completion was also used and adapted to take into account the dynamics of 
the features of the graph |14] . 

In this paper, our goal is to simultaneously predict multiple outputs de- 
fined over the vertices of a time-evolving graph and learn the structure of 
the graph. One important assumption we make is that the network effect 
is a cause and a symptom at the same time. Consider, for instance, the 
problem of forecasting sales based on purchase data from an e-commerce 
market where the semantic information is not reliable. The data can be 
represented as the (bipartite) graph of users and products connected by 
purchases which evolves over time according to the purchase history. We 
expect that within a cluster of co-purchased items, the sales should be cor- 
related, and reversely, correlated sales should induce more edges between 
corresponding items. A similar situation arises in the context of financial 
data where the graph reflects dependencies between stocks and one aims 
at both predicting stock prices and inferring the dependence structure. We 
build on the approach proposed in [6] which shows that, in the case where 
graph structure information is available about the dependencies between in- 
put variables, the efficiency of multivariate predictions can be enhanced. In 
this paper, we tackle a problem of broader scope, that is, we assume that the 
underlying graph is unknown. We propose a formulation of the problem as 
a regularized risk minimization which balances both objectives of prediction 
and matrix completion at the same time. The proposed formulation is con- 
vex with respect to each target variable (predictor and adjacency matrix) 
separately but not jointly convex. We study the performance of this joint 
optimization approach through a set of experiments on real and synthetic 
data. We mention that a byproduct of our investigations is the introduction 
of a generative model using latent factors for creating artificial data sets. 
We discuss empirical convergence, implementation issues and performance 
with respect to each of the two objectives (multivariate prediction and graph 
inference) . 

The problem of simultaneous prediction of multivariate time series with 
graph dependence structure and inference of the graph structure is presented 



in Section 2. We also present the joint regularization approach for solving 
this problem. In Section 3 we discuss the algorithm used and the compu- 
tation of relevant features within this regularization approach. Numerical 
experiments are discussed and results on synthetic and real data sets are 
presented in Section 4. 

2 Learning Dynamic Graph Features and Edges 

2.1 Setup 

We first introduce notations for the main objects of interest in the paper: 

• a sequence of T undirected weighted graphs with n vertices and their 
corresponding nx n adjacency matrices At € S>o,i G {1,2, ...,T}, 
where S" q denotes the set of n x n symmetric matrices with nonnega- 
tive entries. We assume that, the weights of the edges are nondecreas- 
ing functions of time, which in the unweighted case means edges do 
not disappear but new ones may appear over time. 

• a multivariate time series of node features (^t) ^j-<r ^° ^^^^ ^^' "'^* ^ 

We denote by X^ E W^ the ith row of Xt representing the feature vec- 
tor of node i at time t. We assume that this series depends on the series 
of graphs via some function a; : M"^" — ^ M"^'', that is, Xt = (jj{At). 

Given the values of {At,Xt), for t G {1,2, ...,r}, the main goal is to si- 
multaneously predict the future value Xt+i and the future adjacency matrix 
At+i- To this end, we introduce: 

• A descriptor matrix $t € M"^'^, t G {l,2,...,r}, which encodes the 
past information contained both in the time series {Xs)i<s<t and in 
the sequence of adjacency matrices {As)i<s<t into a row vector of 
dimension d for each node. 

• A matrix- valued prediction function / : M"^'^ — )• M"-^? which relates 
the variable Xt+i to the past information <&t so that Xt+i is close to 
/($j). The function / is unknown and has to be estimated from past 
data. We denote the i-th row component of / with the notation f^^' , 
i G {l,2,...,n}. The ?i prediction functions f^^' are assumed to belong 
to the same Hilbert space 7i, equipped with the norm || • ||-h- The norm 

of / G 7^" is defined by H/H^n := ^ElUlf^- 
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• An unknown matrix S € S"g whose elements indicate how hkely it is 
that there is a nonzero value at the corresponding position of matrix 
At+1- The most likely edges at time T + 1 are the ones corresponding 
to the largest values in 5. 

2.2 Motivation 

The challenge in the prediction problem on real graphs evolving over time 
is to relate global and slowly varying node features with local and sudden 
changes of edges. Thus it is reasonable to assume that the evolution of the 
graph is governed by unobserved latent factors [El [15], denoted by ^t, which 
evolve smoothly. For example, in social networks or marketing applications, 
latent factors could be psychological factors which account for the choices 
of consumers. The effect of such factors on the graph can then be measured 
using the adjacency matrix. For instance, the clustering coefficient captures 
homophilic behavior, the node degree is a good indicator of popularity and 
the pagerank refers to centrality and influence, the trends can be read in 
the evolution of node degrees and the tastes are believed to be reflected on 
the first singular vectors of the adjacency matrix. A possible formulation of 
the latent factor model sets ^t to be a pair of matrices {Ut,Vt) governing 
the structure of At through At = UtVt^ ^ where the slow evolution of latent 
factors is guided by some unobserved underlying mechanism. A rigorous 
framework would involve a statistical model for the noise and a specific form 
of dependence of the matrices on their recent history. A formal example 
will be provided in Section 4. Both the latent factors and their evolution 
being unobservable, we assume that some relevant and observable features 
Xt partly capture the information contained in the latent factors. 

2.3 Assumptions 

The approach studied in this paper relies on the assumption that exploit- 
ing simultaneously the structure of the graph and the dynamics of the time 
series should improve both prediction and graph learning. Given the in- 
formation collected over the period t = 1, . . . ,T, which is contained in the 
feature matrices $tj we want to learn / and S which satisfy the following 
assumptions: 

• [Al] Low rank. S has low rank. This is a standard assumption in 
matrix completion problems |18[ [5] . The rationale is that the factors 
Ut, Vt (see Sec. 2.2) should be of small dimension. 



• [A2] Graph growth regularity. The growth of the graph exhibits 
regularity over time, that is, S should be close to the last adjacency 
matrix At- We will use the Frobenius distance between these two 
matrices as an estimate of the error of the learning process with respect 
to graph inference. 

• [A3] Feature growth regularity. The features of the predicted 
graph are assumed to be close to the predicted features. As a con- 
sequence, we assume that the norm of /($t) — ^(S) should be kept 
small. 

• [A4] Stationarity. The dependence mechanism between Xj+i and 
$f can be inferred. Therefore, it is reasonable to assume stationarity 
of the joint distribution of {Xt+i, $f). 

• [A5] Regularity of the predictors over the graph. Neighboring 
vertices i and j in the graph should correspond to similar prediction 
functions /*-*' and /"^ 

2.4 Formulation of the Optimization Problem 

We now introduce a regularization based formulation which reflects each of 
the previous assumptions. To this end, we introduce additional notation. 
For any matrix M, Tr(M) denotes the trace of M and ||M||f := y^Tr (M^M) 
denotes the Frobenius norm of M. We also define ||M||* := Yl^=i^k{M) 
, the nuclear norm of a matrix M, where ak{M) denotes the fc-th largest 
singular value of M. We recall that a singular value of matrix M cor- 
responds to the square root of an eigenvalue of M^M and that the nu- 
clear norm is a convex surrogate of the rank. We also define the ma- 
trix A(/) := (11/'-*^ ~ f \\n)i ■=!• ^^ introduce a convex loss function 
i : M"^*? X M."'^'^ —7- M^ which measures prediction errors. Given the past 
history {(Xt,At) : 1 < t < T}, the proposed optimization problem is then 
to minimize over (/, S) £ Ti x S" q the following functional: 

T-l 

C{f,S) := ^^(/(cl.,),X,+0 + f 11/11?." +£{mT),u^{S)) 
t=i 

+r\\S\U + '^\\S- AtWI + ATY(5^A(/)) , (1) 

where A, r and v are positive regularization parameters. Typical choices of 
the loss function i are the square loss or the sum of hinge loss errors per 



component. Note that using different loss functions allows one to adapt the 
method to different problems, such as regression or classification. 

We now describe each of the separate terms in the above formulation. 
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The first term, Mf) = EJJi ^Ui^t),Xt+i) +f ||/||w", corresponds 
to the prediction task as in standard regularized risk minimization in 
RKHS for supervised learning. 

The second term, J2{S) = ^^\\S\\:^ + |||5 — AtIIfj with // = t/u, repre- 
sents the low-rank matrix denoising objective (static link prediction) 
and reflects assumptions Al and A2. 

The third term, Js^f, S) = i{f{^T),^{S)), penalizes the difference be- 
tween the predicted features and the features of the predicted matrix. 
This term reflects assumption A3. 

The last term, J4{f,S) = Tr(5^A(/)) relates the contributions of / 
and S based on assumption A. 



We point out the connection of work on matrix completion [U \T7[ [9] to 
the minimization of ^(/(^t), w(S')) +^||5||*. Also, in jl3], the authors use 
the minimizer / of Ji and solve the subproblem argmin_5 J2{S) + JsiS, /). 

2.5 Linear Models 

For simplicity of presentation, we now consider the case of Ti being an RKHS 
corresponding to a linear kernel. In addition, each /'*) is a linear function 
represented by a dx (7 matrix l^/'^*), so that /(*)(<I>j) = <^J VF^*). Then we may 
assume ||/''*^||-h = l|l^ I|f and the graph regularity term can be expressed 
in terms of the graph Laplacian of S. Recall that the Laplacian is defined 
as the operator A such that A(S') = D — S where D = diag((ii, . . . , dn) and 
di = Yll=i ^ij ^^^ t^^ degrees of the graph. A standard computation gives 
the following expression for J4: 

l<i,j<n 

where W = {W^^\ . . . , VF^")) G M"^''^^. We use the standard extension of 



the Frobenius norm to 3-tensors, ||VF|||' = y X^^Li II^^*^IIf ^^^ the nota- 



tions: 



l<i,jf<n 



^{W):= (\\W^^ -W^^\{] 



«J=i 



We also assume in this work that the node features are hnear functions of 
adjacency matrices : ijj{At) = AtVt for some G R"^''. Note that degree, 
inter /intra cluster degrees and projection onto specific linear subspaces are 
such features, but not clustering coefficients or statistics on path lengths. 
We later detail how we chose Vt in our experiments. 

Working with a known and static graph. This problem was considered in [6] 
showing interestingly that feature prediction can be improved when using 
graph structure in the regularized optimization formulation. In their work, 
the graph penalty (our J4) takes the form: 

Y, 5,,||w^»-Ty(^-)||i, 

l<i,j<n 

where S is fixed and we can see that an Li norm is used for this penalty 
term instead of L2 in our case. Moreover, their approach allows negative 
interactions between vertices. This indicates that there are clearly many 
different variants to explore. 

3 Learning Algorithm 

In this section we address several issues related to the optimization problem 
p.4p and discuss the learning algorithm we will use. The algorithm is a 
standard projected gradient method, which projects on the constraint set 
£ defined below. The main challenges are the nonconvexity of J4 and the 
nondifferentiability of J2 in the objective. 

3.1 Convexity 

In the experiments conducted, we have learned linear functions /^*'. In this 
case, the term Si<j .<„ <S'ij||VF^*^ — VF''*^|||' is convex with respect to one 
of the variables, if the other is fixed, but not jointly convex with respect to 
{W, S). In order to ensure convergence of an optimization algorithm to a de- 
sired minimum, we determine a domain where C is convex. We use intuition 



from what happens with a similar optimization problem in dimension 2. In- 
deed, consider the minimization of {w, s) — t- sw'^ + as^ + f3w'^ which is the 
simplified functional in the degenerate case where n = 1. The eigenvectors 
of the Hessian are positive iff w'^ < af3, so this condition defines the convex- 
ity region. By adding the quadratic terms in S and W to Q{W,A{S),W), 
let us define 

^{S,W):='^\\W\\j,+ '^\\S-ATfF 

+XQ{W,A{S),W). 

If we suppose that the entries of Z = S — At are nonnegative, then the 
minimizer of ^ is the trivial solution S = At, W = 0. In fact, 

^{S,W) = '^\\W\\l+'^\\Z\\l+ 

x(Q{W,A{AT),W) + Q{W,A{Z),Wyj >0 (2) 

and ior Z = 0,W = 0, ^(5, W) = 0. 

We prove that C is convex over a set £ around the minimizer of ^ and 
we will ensure henceforth that the descent algorithm takes place inside this 
convex domain. 

Proposition 1 The function ^ is convex in the interior of the set: 



W F < 7-^ T 

-2A(V^+1) 

Proof. We introduce the slack variable Z = S—At and isolate the quadratic 
part of *(Zo + Z,Wo + W) for some (Zq, Wq) : 



X[2Q{W,A{Z),Wo) + QiW,A{AT + Zo),W)^ 
Thanks to Cauchy-Schwarz and the basic norm property ||^i?||i? < ||j4||^||i?||^, 

Q{W,A{Z),Wo) > -\\W\\f\\A{Z)\\f\\Wo\\f. 
We have A{Z) = D — Z. We get, again by Cauchy-Schwarz 



n n 



i=l j=l 1=1 j=l 



and therefore 

||A(Z)||^ = \\D - Z\\f < \\D\\f + \\Z\\f < {V^ + 1)\\Z\\f. 
On the other hand Q{W, A{At + Zq),W) > 0, so 

R{Z, W) > '^\\Z\\l + ^\\W\\l - 2XiV^ + l)\\W\\F\\Z\\F\\Wo\\F. 
Letting z = \\Z\\f,w = \\W\\f,wo = ||Wo||f, we have 

R{Z, W) > ^z^ + ^w;2 - 2A( V^ + l)wzwo . 

ipwo '■ (z, w) I—)- l^^ + ^w'^ — 2X{^/n + l)wzwo is a quadratic form. Therefore 
R{Z, W) is always nonnegative if il^mo is positive semidefinite positive. That 
is, if z^ + K > (always true) and uk — 4A^(\/n + l) Wq > 0, and this 
completes the proof. 

3.2 Smoothing the Nuclear Norm 

For optimization purposes, we chose to replace the nonsmooth nuclear norm 
term by a smooth approximation, in the spirit of the optimization literature, 
for example [13] . 

gr^iS) = max |(5,Z) - ^\\Z\\j, \ ai{Z) < l|. 

Each of these parameterized functions is a lower bound that approaches H^H* 
as r] — )• 0^, while being differentiable. Differentiability is due to —grj being 
related to a Moreau envelope [11] and specifically to the squared distance of 
-S from the unit ball of the spectral norm. 

3.3 Gradient Evaluation 

In the case of a differentiable loss i, we explicitly compute VC{W, S) as 

dC ^~^ 

{W,S) = Y,^wKW'^t,Xt) + 



9^ 



>V 






and 



—{w, s) = Vs^{w^T, sn) + \Aiw)+ 



'{S - At) + rC/(S)Diag ("111111 (l, ^li^\ 



V{S)\ 



where S = U{S)Diag[ai{S), . . . , an{S))V{Sy is a singular value decom- 
position of S. In the case of squared loss, 



£(i^<i>t,Xt) = ^ l|xf - $fV«l 



j=i 

Therefore: 



Vsi{w<^T,sn) = ({sny'' -^^^^w'-'^)n^ 



n 



i=l 



VwKW^uXt) = ($«"($fV» -X«))^' 



j=l 
4 Numerical Experiments 

4.1 Description of Data Sets 
4.1.1 Synthetic Data 

We introduce a generative model for the graph sequence and the node-related 

features: 

r Xt = AtQ 

At = UtVt" + zt 

where ut^i,vt^i are multivariate Gaussian vectors J\f{0,6'^Ir) in M'', and the 
entries of Zt are independently drawn from a centered Gaussian with variance 
a^. We also set 

ll^^^l II 

~~:^ — 



h{x) = el e "^ (x — f 1) -|- e "^2 (3; _ ^2' 

where vectors v\^V2 £ 1^'+ are chosen randomly and e, o"i, o"2 are positive con- 
stants. We use here n = 100, r = 4, T = 60 and the entries of Uq, Vq G M"-^^ 
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are drawn according to a uniform distribution in (0, 1). Note that such data 
by construction fulfill the required assumptions, and that the nonlinearity 
of the smooth function h is what makes the contribution of the Laplacian 
term J4^{f,S) non-trivial, we highlight this phenomenon in Figured) 



Graph Completion 



Regression 





Figure 1: Prediction accuracy as a function of A (predictors' regularity on 
the graph) and e (nonlinearity of latent factors growth) . 



4.1.2 Marketing Data 

We use data from several months of book purchase history of a major e- 
commerce website to evaluate our method. We tested our method on the 
task of predicting the cross-selling and the sales volumes on two types of 
data sets : (1) the top-sold items and (2) at an aggregated level among 
different product categories. 

1. Best-sellers For our test we selected a set of 300 top-sold books, and 
we aimed at predicting the future sales of those books, and also predict- 
ing the cross-sales, or co-purchases. We consider the set of products 
being the vertices of the dynamic graph sequence, and construct the 
'co-purchase' graph as follows. At time t one can represent the state 
of market by a binary ^users x litems matrix Mt where entry {i,j) 
is non-zero if user i has purchased item j so far. We define the co- 
purchase matrix At = M^ Mt as the adjacency matrix of the weighted 
graph. The co-purchase graph consists of the products as nodes and 
the weights of edges are given by the number of co-purchases of two 
products. We use 32 weeks (8 months) of observation for learning, 
and predict the evolution of sales volumes of the books over the next 5 
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weeks. The target feature, namely the sales volume is obtained using 
the bipartite graph of users and products. The degrees of items in the 
bipartite graph represent the sales volumes. Our algorithm outputs 
simultaneousely predictions of the top-selling items and a matrix S 
predicting the cross-sellings. 

2. Categories Predicting the sales and co-sales at the aggregated level 
of product categories is relevant for marketers and supply-chain man- 
agers. We collected data from n = 195 book categories over a history 
of T = 74 week periods. The entry {i,j) of At counts the number of 
users having purchased items of category i at time t — 1 who purchased 
at least an item from category j at time t. The features of interest are 
the weekly sales of each category and the number of new users in each 
category. 

4.2 Features and Descriptors 

We recall that our setup consists of a regression-type model for the prediction 
of (Xt+i,At+i) given the available information at time T which includes 
the previous adjacency matrices Ai, . . . ,At of size n x n and the node- 
related feature matrices Xi, . . . , Xt of size n x q. Our approach involves a 
^T of size nx d called descriptor matrix which encompasses the information 
contained in the past realizations (Xt,At) for t <T. 

4.2.1 Node Features 

The feature matrix Xt contains two types of node features: 

1. Features whose predicted values are of direct interest. For instance, 
the volume of sales or popularity of an item, represented by the degree 
of the related vertex in the purchase graph, or the growth rates of the 
degree as an indicator of the penetration level [T7] of the item in the 
market. 

2. Features which are believed to measure relevant underlying factors 
that govern graph structure. For instance the clustering coefficient 
[20j taken as a proxy for homophily is not of direct interest, but is a 
useful quantity for inferring the future connections of a node. 

For simplicity, we set here Xt = At^^ where is an n x g matrix. We assume 
that the columns of fi contain (i) the constant vector 1„ (since Atln is the 
vector of node degrees), (ii) the indicator vectors of several clusters which 
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we identified in the graph and (iii) the top k = 5 eigenvectors of At which 
are beheved to capture most of the structure of Af. 

4.2.2 Node Descriptors 

The descriptors contain the features and also other quantities that are useful 
for predicting the features. In the present work, we made the following 
choice: 

$t = {Atn, {At - At^i)n, {At - 2At.i + At-2)^) 

= {Xt,Xt — Xt-i,Xt — 2Xt-i + Xt-2) 

which is a matrix of size n x 3q representing features, their velocity 
and acceleration rates. We remark that time-related statistics of the recent 
history of {Xt)t<T (such as moving averages, residual variance), or other 
quantities accounting for alternative representations of the time series (such 
as Fourier or wavelet coefficients, polynomial approximation) could also be 
included. 

4.3 Evaluation Metrics 

Various prediction tasks could be studied in our setup with specific criteria 

for optimization and evaluation. We focus here on regression (for feature 

prediction) and graph completion (for link prediction), but classification of 

vertices may also be a useful task. 

Regression. When the effective value of an asset in the future is of interest, a 

squared error metric is appropriate, leading to the use of ||-^t+i — /(^t)||f 

for evalution and l{f{^t),Xt) = \\Xt+i - f{^t)\\F in the objective. The 

reported values are the relative errors - — Vy ■'\'^'"^ , 

Graph Completion. Our method aims simultaneousely at the prediction 

of Xt+1 and At+i- We measure the quality of prediction of At+i by 

\\S — At-i-i\\f, and the relative values reported are n^"*"^ i, . 

II J- -TJ-ll-i r- ||At+i||f 

Classification. Specific patterns may appear in the time series Xt which we 
may wish to predict. For instance, predicting top-selling items in a given 
market is definitely of interest. This problem can be formulated as follows: 
the matrix Xt represents the sales volumes over a market (each component 
corresponds to a product), and we assign a ±1 label depending on the order 
of magnitude of the increase of sales volumes over a specific time window. 
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4.4 Comparison with Related Methods 

Since the problem which we address (simultaneous graph completion and 
feature prediction) is a novel one, there are no direct competing methods to 
compare with. Thus, in our experiments we have considered the following 
baselines which address the most similar learning problems: 

1. Ridge Regression. Learning the function / := argmin^ Ji through the 
minimization of the penalized regression Ji(/) = '^t=i ^{f{^t), ^i+i)+tll/|||^i 
leads us to a prediction function / which we use for predicting Xt+i 

by f{^T)- This approach addresses the prediction of node features 
as independent time series. It does not take the graph sequence into 
account and does not predict the future graph. 

2. Rank- free prediction. We used our formulation with the choice of 
r = 0. In this case, the loss function ignores the low-rank hypoth- 
esis in the graph structure. The prediction process only relies on the 
dynamics of graph features. 

3. Graph completion through shrinkage. We also considered a method 
only dedicated to link prediction and matrix denoising, ignoring the 
feature prediction part of the problem. This method only uses the 
observation At as a noisy version of At+i, and aims at predicting 
At+i given only this observation and the low rank hypothesis. We 
solve the problem vahis J2{S) where J2{S) = \\\S — At\\\ + Af||5'||* 
to estimate the adjacency matrix as a low rank approximation of At. 
The solution of this problem is obtained by singular value shrinkage 
[3], namely if At = C/diag (cri(Ar), . . . ,(T„(^t'))^^ is the singular 
value decomposition of At, then the image of At by a shrinkage of 
parameter // is given by D^{At) = C/diag (o"j(At') — /i)_|_y^ . 

4.5 Results 

We implemented a first-order gradient descent algorithm with projections on 
the convex set £ at each iteration. In Figure [2] we show how cross validation 
can be done first on the two standard objectives of regression and graph com- 
pletion (matrix denoising) taken separately for setting the values of /x = ^ 
and AC efficiently. This reduces the number of smoothing parameters, and 
thanks to experiments on synthetic data we studied the dependence of opti- 
mal regularization parameters k, r, i/, A on the model parameters r, o", e. We 
illustrate in Figure [3] the efficiency of our algorithm for minimizing different 
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arginin3n||S||..||S-A^||; 




Figure 2: Cross-validation for /i and k on the two standard objectives of 
regression and graph completion (matrix denoising) taken separately (top 2 
plots) and dependence of constants chosen by cross-validation //cVi'^cv on 
generative model constants r, a. 



objective terms despite competing effects at the starting point of optimiza- 
tion. We emphasize the decrease of validation error as a function of the 
number of iterations. Finally, in Figure [H we report results of experiments 
on marketing data sets, which show that, with appropriately chosen hyper- 
parameters, our approach outperforms standard competing baselines. We 
also observe that on the top-selling items data set, the nuclear norm term 
does not add to prediction accuracy compared to the rank-free r = case. 
Further study is needed to better understand whether this effect is due to 
the possible presense of high noise or to the higher efficiency of predicted 
features in predicting the future graph. 
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Quantities to Control 




J,=E,l(f(*,),X,^,).K||f| 
.J3=l(f(*^),<a<S)) 
.I|A,-S||' 
.l|S|[. 
_J_j=<s,a{f)> 



V- " t.' ji J> i^!!L».«.«;niiiii»«T! 



iteration 
Validation Error (HYBRID OPTIIVIIZATION) 









- 


■ 
1 *, 


- ■ - Regression 

- • - Graph Completion 


- 






- 





Figure 3: Numerical behavior of different terms of the risk functional and 
validation error versus iteration count of a gradient descent algorithm (syn- 
thetic data). 



5 Conclusion 

We have studied the benefits of joint regularization for simultaneously pre- 
dicting time series which are related, as well as the graph representing these 
relations. The main goal has been to investigate the feasibility as well as 
the advantages of such a simultaneous estimation procedure. Although the 
corresponding optimization problem is not convex, we identified a convexity 
domain which in some cases will contain the optimal solution. The experi- 
mental results indicate that exploring such joint regularization and learning 
problems can greatly improve the predictive performance of these methods. 
Clearly, an important question is to determine conditions under which the 
global optimum is attained within the convexity domain of the optimization 
problem (12. 4p . A key contribution of this work is to show the potential of this 
hybrid approach, as well as to propose a practical algorithm for solving the 
learning problem. Moreover, given the promising empirical results, future 
improvements in the optimization methodology may lead to even further 
improvements in terms of predictive performance. 
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Figure 4: Performance on real data for different values of i', where we have 
set r = z^/icv for a value of /^cv fixed in advance by cros- validation. Book 
categories sales volumes and cross-selling on the top and best-selling items 
on the bottom. 
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